Giant-Pumpkins.knit

class: center, middle, hide-logo

# First Machine Learning Workshop

## by

##### Author/Presenter: Ruben Ernst/Mathias Steilen
##### Last updated: _2022-11-24 11:47:23_

---

### Today's Mission

.center[
<img src="GraphicsSlides/get in loser.png" width="60%" />
]

Courtesy of the TidyTuesday project - Check it out!

---

# Background

> The Great Pumpkin Commonwealth's (GPC) mission cultivates the hobby of growing giant pumpkins throughout the world by establishing standards and regulations that ensure quality of fruit, fairness of competition, recognition of achievement, fellowship and education for all participating growers and weigh-off sites.

_[Link to Website](https://gpc1.org/)_

.center[
<img src="GraphicsSlides/GPCJoinPagecoop.jpg" width="50%" />
]

---

# Let's look at the files

.panelset[

.panel[.panel-name[training]

```r
training <- read_csv("./Data/training.csv")
```

```
## Rows: 8745 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): type, grower_name, city, state_prov, country, gpc_site
## dbl (6): year, place, weight_kg, ott, est_weight, id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

This file will be used for training/fitting your model.

]

.panel[.panel-name[testing]

```r
holdout <- read_csv("./Data/holdout.csv")
```

```
## Rows: 2187 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): type, grower_name, city, state_prov, country, gpc_site
## dbl (5): year, place, ott, est_weight, id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

This file will be used to make predictions on. There is no target variable in here, so there won't be data leakage during training. However, before submitting your predictions, please follow the sample submission format.

]

.panel[.panel-name[sample_submission]

```r
sample_submission <- read_csv("./Data/sample_submission.csv")
```

```
## Rows: 2187 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): id, weight_kg
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

**Important**: Your submission to our email address must adhere to this format (CSV file).

]

---

# Our Basic Example

**Disclaimer**

.pull-left[

Some of you might feel like this:

.center[
<img src="GraphicsSlides/sad-cry.gif" width="75%" />
]

]

.pull-right[

And some of you might feel like this:

]

The learning curve is always steep when looking at it from the bottom. Use the time later to ask your more experienced peers (or us) questions.

---

#### Our Basic Example: The ol' reliable$

Make splits from the training first:

```r
dt_split <- initial_split(training)
dt_train <- training(dt_split)
dt_test <- testing(dt_split)

folds <- vfold_cv(dt_train, v = 5) # resampling for tuning
```

```r
dt_split
```

```
## <Training/Testing/Total>
## <6558/2187/8745>
```

---

#### Our Basic Example: The ol' reliable

Let's fit a basic, linear regression with a penalty.

```r
lin_spec <- linear_reg(mixture = tune(), penalty = tune()) %>%
  set_mode("regression") %>%
  set_engine("glmnet")
```

```r
lin_rec <- recipe(weight_kg ~ year + place + ott + est_weight + country,
                  data = training) %>%
  step_impute_mean(all_numeric_predictors()) %>%
  step_novel(all_nominal_predictors()) %>% 
  step_unknown(all_nominal_predictors(), new_level = "not specified") %>%    
  step_other(country, threshold = 0.03) %>%
  step_dummy(all_nominal_predictors(), one_hot = T) %>% 
  step_rm(country_other)
```

---

#### Our Basic Example: The ol' reliable

```r
lin_rec %>% prep() %>% juice()
```

```
## # A tibble: 8,745 × 11
##     year place   ott est_weight weight…¹ count…² count…³ count…⁴ count…⁵ count…⁶
##    <dbl> <dbl> <dbl>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1  2013   356  940.       241.     453.       0       0       0       0       0
##  2  2013   356  904.       451.     453.       0       0       0       0       0
##  3  2013   358  917.       241.     453.       0       0       0       0       0
##  4  2013   359  892.       434.     453.       0       0       0       0       0
##  5  2013   361  917.       241.     453.       0       1       0       0       0
##  6  2013   363  706.       241.     452.       0       0       0       0       0
##  7  2013   363  874.       408.     452.       0       0       0       0       0
##  8  2013   363  963.       241.     452.       0       1       0       0       0
##  9  2013   366  879.       416.     451.       0       1       0       0       0
## 10  2013   368  706.       241.     451.       0       0       0       0       0
## # … with 8,735 more rows, 1 more variable: country_United.States <dbl>, and
## #   abbreviated variable names ¹weight_kg, ²country_Austria, ³country_Canada,
## #   ⁴country_Germany, ⁵country_Italy, ⁶country_Japan
```

---

#### Our Basic Example: The ol' reliable

.scroll-output[

```r
lin_wf <- workflow() %>%
  add_recipe(lin_rec) %>%
  add_model(lin_spec)

lin_wf
```

```
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## • step_impute_mean()
## • step_novel()
## • step_unknown()
## • step_other()
## • step_dummy()
## • step_rm()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = tune()
## 
## Computational engine: glmnet
```

]

---

#### Our Basic Example: The ol' reliable

Let's tune the penalty:

```r
lin_tune_results <- tune_grid(
  lin_wf,
  resamples = folds,
  grid = grid_regular(penalty(),
                      mixture(),
                      levels = 10)
)
```

---

#### Our Basic Example: The ol' reliable

.scroll-output[

Let's look at the results:

```r
lin_tune_results %>% 
  show_best(metric = "rsq")
```

```
## # A tibble: 5 × 8
##         penalty mixture .metric .estimator  mean     n std_err .config          
##           <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>            
## 1 0.0000000001    0.111 rsq     standard   0.870     5 0.00374 Preprocessor1_Mo…
## 2 0.00000000129   0.111 rsq     standard   0.870     5 0.00374 Preprocessor1_Mo…
## 3 0.0000000167    0.111 rsq     standard   0.870     5 0.00374 Preprocessor1_Mo…
## 4 0.000000215     0.111 rsq     standard   0.870     5 0.00374 Preprocessor1_Mo…
## 5 0.00000278      0.111 rsq     standard   0.870     5 0.00374 Preprocessor1_Mo…
```

]
---

#### Our Basic Example: The ol' reliable

Let's finalise the model with the best parameters from tuning:

```r
lin_fit <- lin_wf %>% 
  finalize_workflow(select_best(lin_tune_results, metric = "rsq")) %>% 
  fit(dt_train)
```

Fitting onto the training split.

---

#### Our Basic Example: The ol' reliable

```r
lin_fit %>% 
  predict(dt_test)
```

```
## # A tibble: 2,187 × 1
##    .pred
##    <dbl>
##  1  424.
##  2  488.
##  3  423.
##  4  424.
##  5  494.
##  6  441.
##  7  439.
##  8  439.
##  9  307.
## 10  484.
## # … with 2,177 more rows
```

---

#### Our Basic Example: The ol' reliable

```r
lin_fit %>% 
  augment(dt_test) %>% 
  rsq(truth = weight_kg, estimate = .pred)
```

```
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.860
```

---

#### Our Basic Example: The ol' reliable

Happy with that? Fit it on the entire training data provided and then make predictions for your final submission:

```r
final_model <- lin_wf %>% 
  finalize_workflow(select_best(lin_tune_results, metric = "rsq")) %>% 
* fit(training)
```

---

#### Our Basic Example: The ol' reliable

.scroll-output[

Make predictions and save the results as _.csv_. Then submit your predictions to us and we will score them. You can submit as often as you like, and we'll give you information about your performance on the holdout, as we have the target values.

```r
final_model %>% 
  augment(holdout) %>% 
  select(id, .pred) %>% 
  rename(weight_kg = .pred)
```

```
## # A tibble: 2,187 × 2
##       id weight_kg
##    <dbl>     <dbl>
##  1 10194      491.
##  2  9228      451.
##  3 10361      487.
##  4 10033      469.
##  5 10369      490.
##  6  8875      489.
##  7 10203      487.
##  8  9466      441.
##  9  9850      483.
## 10  9931      439.
## # … with 2,177 more rows
```

]

---

#### Our Basic Example: The ol' reliable

.scroll-output[

```r
final_model %>% 
  augment(read_csv("./Data/holdout_with_target.csv", show_col_types = F)) %>% 
  ggplot(aes(weight_kg, .pred)) +
  geom_point(alpha = 0.2) +
  geom_abline(lty = "dashed", colour = "red")
```

<img src="Giant-Pumpkins_files/figure-html/unnamed-chunk-24-1.png" width="100%" />
]

---

### You won't have the target variable on the holdout data set for two reasons

.pull-left[

.center[

**Reason 1:**

<br>

<img src="GraphicsSlides/roll safe.png" width="100%" />
]

]

.pull-right[

.center[

**Reason 2:**

<br>

<img src="GraphicsSlides/pumpkin spice.jpg" width="60%" />
]

]

There's something to win here - so play fair.

---

### Off you go

Have a look at our example for dealing with splits and hyperparameter tuning in the Tidymodels tutoring session. We'll be here for you to ask questions, once you get to it. Copying and pasting the code in the slides is **allowed**!

Spend as long as you need modelling.

#### 🕒 20:00

---

# That's it for today!

After our session: Watch videos from Julia Silge, Andrew Couch and David Robinson. Most importantly, have fun while learning.

For further questions, feel free to reach out to us. Make sure to stay updated on our socials and via our website where all resources and dates are also published.

<br>

.center[
<img src="GraphicsSlides/Logo RUG hell.png" width="60%" />

**[Website](https://rusergroup-sg.ch/) | [Instagram](https://www.instagram.com/rusergroupstgallen/?hl=en) | [Twitter](https://twitter.com/rusergroupsg)**

]

---

class: middle, inverse, hide-logo

# Thank you for attending!

<em style="color:#404040">The material provided in this presentation including any information, tools, features, content and any images incorporated in the presentation, is solely for your lawful, personal, private use. You may not modify, republish, or post anything you obtain from this presentation, including anything you download from our website, unless you first obtain our written consent. You may not engage in systematic retrieval of data or other content from this website. We request that you not create any kind of hyperlink from any other site to ours unless you first obtain our written permission.</em>